Identifying animal behaviours from accelerometers: Improving predictive accuracy of machine learning by refining the variables selected, data frequency, and sample duration

Abstract Observing animals in the wild often poses extreme challenges, but animal‐borne accelerometers are increasingly revealing unobservable behaviours. Automated machine learning streamlines behaviour identification from the substantial datasets generated during multi‐animal, long‐term studies; however, the accuracy of such models depends on the qualities of the training data. We examined how data processing influenced the predictive accuracy of random forest (RF) models, leveraging the easily observed domestic cat (Felis catus) as a model organism for terrestrial mammalian behaviours. Nine indoor domestic cats were equipped with collar‐mounted tri‐axial accelerometers, and behaviours were recorded alongside video footage. From this calibrated data, eight datasets were derived with (i) additional descriptive variables, (ii) altered frequencies of acceleration data (40 Hz vs. a mean over 1 s) and (iii) standardised durations of different behaviours. These training datasets were used to generate RF models that were validated against calibrated cat behaviours before identifying the behaviours of five free‐ranging tag‐equipped cats. These predictions were compared to those identified manually to validate the accuracy of the RF models for free‐ranging animal behaviours. RF models accurately predicted the behaviours of indoor domestic cats (F‐measure up to 0.96) with discernible improvements observed with post‐data‐collection processing. Additional variables, standardised durations of behaviours and higher recording frequencies improved model accuracy. However, prediction accuracy varied with different behaviours, where high‐frequency models excelled in identifying fast‐paced behaviours (e.g. locomotion), whereas lower‐frequency models (1 Hz) more accurately identified slower, aperiodic behaviours such as grooming and feeding, particularly when examining free‐ranging cat behaviours. While RF modelling offered a robust means of behaviour identification from accelerometer data, field validations were important to validate model accuracy for free‐ranging individuals. Future studies may benefit from employing similar data processing methods that enhance RF behaviour identification accuracy, with extensive advantages for investigations into ecology, welfare and management of wild animals.

Classification of behaviours from acceleration data can be achieved manually, through observing animals and attributing acceleration signals to different behaviours undertaken (Wilson et al., 2006).Decision trees then utilise a series of questions to categorise the data with respect to the observed signal criteria (McClune et al., 2014;Riaboff et al., 2019;Valletta et al., 2017).
Although decision trees can be accurate and effective, they are time-consuming to construct and use, especially when animals are monitored for long periods of time and undertake many different behaviours (Hammond et al., 2016).Increasingly, machine learning is being used to automate behaviour recognition, either through unsupervised or supervised methods.Unsupervised machine learning groups acceleration signals into likely behaviour categories by identifying similarities in patterns.More commonly, supervised machinelearning methods, such as random forest (RF) models, are trained using previously classified accelerometer data and are then used to predict animal behaviours using distinct accelerometer attributes (Breiman, 2001).These methods can rapidly and accurately identify vast datasets from animal behaviours in the wild, where observation is not always possible.
Accelerometer data calibrated via observations forms a behaviour 'training' dataset (Shuert et al., 2018;Wang, 2019).RF models generate multiple (e.g.300+) decision trees, and the most frequent predicted classification from the many individual trees generated is selected as the predicted behaviour for each time period (Li, 2013).Training datasets are generated from a proportion of the training data (60%-80%), which can be tested for predictive accuracy using the remaining test data (Lush et al., 2016;Venter et al., 2019).Validation using data that was not initially used to train the model provides an independent measure of predictive accuracy.
Overall, decision trees can be highly accurate, however, they are prone to overfitting behavioural categories, that is, they are highly accurate at identifying training data but less so for unidentified data (Valletta et al., 2017).Automated RF models solve this problem by generating multiple decision trees from a subset of the available variables and a subset of the classified data, so are less subject to overfitting and have an increased accuracy (Cutler et al., 2007;Nathan et al., 2012;Valletta et al., 2017).However, inherent errors with RF modelling can occur such as incorrectly identifying or overlooking certain behaviours (Rast et al., 2020;Wang et al., 2015).Indeed, the accuracy of RF modelling has been reported to be as low as 0% for mountain lion (Puma concolor) behaviours such as grooming while their locomotory behaviours were identified with an accuracy above 90% (Wang et al., 2015).Graf et al. (2015) hypothesised that the erratic nature of grooming, which requires many postures and is conducted at varying frequencies, meant it was difficult to define using accelerometer metrics and hence, was often misidentified by RF models.Revising methods that can improve predictive accuracy is an important component of data processing that is often overlooked in ecological studies and has wide-ranging implications that would benefit researchers by improving model outputs.
There are three main ways that have been described to change or improve the efficacy of RF modelling, and these are implemented during acceleration data processing before the RF models are fitted (Alvarenga et al., 2016;Pagano et al., 2017;Tatler et al., 2018).They are (i) increasing the number of calculated variables that improve the explanatory power and specificity in describing behaviours (Tatler et al., 2018;Wijers et al., 2018), (ii) increasing or decreasing the frequency of acceleration data recording (Fogarty et al., 2020;Wang et al., 2015) and (iii) ensuring that the training data incorporates a similar duration of each of the behaviours (here denoted 'standardised duration'; Chen et al., 2004;Pagano et al., 2017;Wijers et al., 2018).

| Choice of calculated variables
The variables calculated from accelerometer data that are used to generate an RF model can affect overall model accuracy (Tatler et al., 2018, Wijers et al., 2018).Many studies simply select commonly used variables, but do not investigate whether these generate the most accurate model (Fogarty et al., 2020;Venter et al., 2019).Variables typically consist of static and dynamic acceleration (Smith, 1997;Wilson et al., 2006), dynamic body acceleration (DBA) (Qasem et al., 2012;Wilson et al., 2020) and pitch and activity budget, biologging, domestic cat, machine learning, predictive accuracy, random forest model

T A X O N O M Y C L A S S I F I C A T I O N
Applied ecology, Behavioural ecology, Conservation ecology, Movement ecology, Zoology roll (Fehlmann et al., 2017;Nathan et al., 2012;Wilson et al., 2008).
Potential extra variables might include the dominant power spectrum frequency and amplitude, and ratios of Vectoral Dynamic Body Acceleration (VeDBA) to dynamic acceleration (Fehlmann et al., 2017;Lush et al., 2018;Wang et al., 2015), to name just a few.
While some metrics provide an instantaneous measurement of motion in one or up to three axes, the running standard error of any waveform indicates its amplitude and therefore the 'size' of the acceleration movement over time of a particular behaviour, which can therefore also be important in behaviour classifications (Laich et al., 2008;Nathan et al., 2012;Qasem et al., 2012;Smith, 1997).

| Adjustment of accelerometer data frequency
Accelerometer data, while usually recorded at sub-second sampling frequency (up to 140 Hz, Sur et al., 2017), are often summed or expressed as a mean over 1 or 2 s to provide summary metrics of movements (Lush et al., 2018;Pagano et al., 2017;Shepard, Wilson, Halsey, et al., 2008;Wijers et al., 2018).The use of these lower-resolution recordings facilitates rapid processing of accelerometer data and can be an important consideration given computational power, battery life and the study duration and aims.However, higher sampling frequencies could provide more precise information for fast-paced or high-speed behaviours such as running (Chakravarty et al., 2019).
Alternatively, aperiodic, or 'slower' behaviours such as feeding may, in fact, be represented better by an average over a few seconds (Alvarenga et al., 2016;Lush et al., 2018).Therefore, the inclusion of data recorded at different frequencies (via sub-sampling or as a mean over time) has the potential to affect the accuracy and reliability of the RF model with which to predict behaviours (Alvarenga et al., 2016;Hounslow et al., 2019;Lush et al., 2018).

| Standardised durations-balancing the duration of each behaviour in the training dataset
There is some evidence that RF models trained using datasets that have a larger number of examples of some behaviours than the others (i.e. they use every behaviour example collected and therefore have an 'inconsistent' duration of each in the dataset, e.g., an abundance of 'resting' behaviour), skew the predictions of behaviours in favour of the more abundant behaviour classification while less readily predicting infrequent behaviours (Chen et al., 2004;Smit et al., 2023).Behaviours that are hard to observe during calibrations, such as mating, may therefore be misclassified during wild animal behaviour predictions.This potential bias can be minimised by subsampling abundant behaviours to generate a more 'standardised' duration distribution of behaviours in the training dataset (Pagano et al., 2017;Wijers et al., 2018).
This study aimed to examine how effective various RF models were at identifying behaviours when different aspects of the training data [(i) to (iii) above] were changed.These models were used to identify the behaviours of a model quadruped-free-ranging domestic cats (Felis catus, hereafter 'cats').Cat behaviours were also manually identified using a decision tree to assess whether the RF models reliably identified the behaviours of free-ranging animals.Cats were studied as they are a useful proxy for wild animal movement and behaviour research, in part because they are readily handled which facilitates device deployment, but also because they roam freely outdoors, replicating behaviours that might occur in wild cryptic terrestrial species.Furthermore, while accelerometers have been used to study cat activity previously (Andrews et al., 2015;Lascelles et al., 2008;Naik et al., 2018;Thomas et al., 2017), and some have identified cat behaviours from accelerometers (Kestler & Wilson, 2015;Watanabe et al., 2005;Watanabe & Takahashi, 2013), this research develops the use of RF models to efficiently and accurately process accelerometer data and identifies free-ranging domestic cat behaviours in detail.We aim to provide a framework for other researchers using RF models for behaviour identification to improve model accuracy and generate reliable activity classifications.

| Animals and study sites
Nine adult domestic cats (4 females, 5 males; aged 6 months-8 years) which were housed inside ('indoor cats') at Mid Antrim Animal Sanctuary, Antrim, Northern Ireland, were collared and filmed to calibrate behaviours.Subsequently, five domestically owned cats (4 females, 1 male; aged 9 months-12 years, 'outdoor cats', see Table A1) that were free to roam outside their owners' houses were recruited in Northern Ireland and collared to identify their natural behaviours (see below and Appendix A for details).

| Calibration of animal behaviours and accelerometer signals
Indoor cats were fitted with neck collars to which tri-axial accelerometers ('Daily Diary': Wilson et al., 2008) recording at 40 Hz were affixed.Accelerometer data were synchronised with video footage of the cats and distinct behaviours were labelled ('rest', 'walk', 'trot', 'run', 'collar shake', 'feed' and 'groom')

| Development of a decision tree for behaviour identification
A decision tree for identifying behaviours from the accelerometer data was developed from the calibrated accelerometer signals.This was accomplished by an observer examining metrics derived from the examples of calibrated behaviour data.Distinguishing features were identified which were indicative of different movements, for example, a high VeDBA (sensu Qasem et al., 2012), changes in pitch, or patterns in the amplitude and frequency of the dynamic acceleration (see the decision tree Figure A1).The decision tree accuracy was tested by the observer using it to identify the calibrated samples of behaviours and calculate the percent that was correctly identified (Table A2).

| Generating the datasets for RF modelling
From the labelled, video-calibrated accelerometer data, a 'base' dataset of variables was calculated at 40 Hz.This included 13 variables; raw-('acc'), static-('st') and dynamic acceleration ('dy'), for all three axes: lateral (sway), vertical (heave) and sagittal (surge) (x, y and z, respectively).Vectoral dynamic body acceleration (VeDBA), smoothed VeDBA ('VeDBAs') over 2 s, 'Pitch' and 'Roll' were also calculated (definitions and equations for these variables are given in Appendix A and Table A3).A second 'extended' dataset at 40 Hz was generated by calculating eight further variables; the data from each behaviour were grouped and a running 2 s standard error of the variables was calculated (Table A4).Two further datasets were generated by calculating the mean values over 1 s for all the variables in that dataset, generating a base and an extended dataset at 1 Hz.Four more 'standardised duration' datasets were then derived from these by randomly subsampling the data to consist of a maximum of 60 s of each behaviour (rather than, e.g., over 2000 s of 'rest' behaviour) (Pagano et al., 2017).A time period of 60 s was chosen as most behaviours were recorded for at least this amount of time (Table A2), and this time period provided a large enough dataset to train and validate the models.Where less than 60 s of a certain behaviour occurred, 100% of these data were included in the analysis.These calculations generated eight training datasets (Figure 1) that were used to fit RF models for the identification of domestic cat behaviours.

| Generating the RF models
Using R software (version 3.4.0,R core team 2014) and the package randomForest (Breiman, 2001), RF models were generated from the eight datasets using a random sample of 60% of the calibrated data.To train each model, we fit 500 classification trees and used a random subset of three predictor variables for each split in the tree (Lush et al., 2018;Pagano et al., 2017).A minimum number of five data points was used during classification regressions and 10 during predictions (Breiman, 2001).These models were then used to predict the behaviours of the remaining 40% F I G U R E 1 Development of datasets used for random forest modelling.Base datasets consisted of 13 'base' variables including raw acceleration, static-and dynamic -acceleration, all in three axes, heave, surge and sway, plus VeDBA, smoothed VeDBA over 2 seconds, Pitch and Roll.'Extended' datasets consisted of the base variables plus the standard error of raw and dynamic acceleration in all three axes, VeDBA and smoothed VeDBA.Data were collected at 40 Hz and the mean of each variable was also calculated over each second to generate datasets at 1 Hz.'Standardised duration' datasets were derived from subsampling the 'inconsistent duration' 40 Hz and 1 Hz datasets, so each had a maximum of 60 seconds of any one behaviour, whereas 'inconsistent duration' datasets included all available behavioural samples.
of the data.The most frequent prediction across all trees was selected as the final classification, which was then compared to the actual, video-identified, behaviour (Breiman, 2001;Pagano et al., 2017).We calculated the 'out-of-bag' (OOB) error rate and the Gini Index for each model and evaluated the predictive accuracy of each model from the precision, recall and F-measure of each behaviour (see Appendix A 'Measuring the accuracy of RF models').The Gini Index indicates the importance of a variable in improving the purity of behaviour classifications (Breiman, 2001;Christensen et al., 2023;Han et al., 2016).High F-measures and low OOB error rates indicate good model accuracy, but a low OOB error rate combined with a low F-measure indicates model overfitting, where the model can reliably classify data from the training dataset but not the validation dataset.

| Free-ranging cat behaviour identification
The five outdoor cats were fitted with collars bearing the same accelerometers ('Daily Diary', Wilson et al., 2008) set to record at 40 Hz (see Appendix A for details).Devices were fitted to hang under the chin of the cats and recorded for a total of 13.72 days (mean 2.74 ± 0.60 days per individual).

| Identification of free-ranging cat behaviours via decision tree and RF models
The free-ranging cat behaviours were first identified manually by a researcher examining the accelerometer data.Using the decision tree developed from the categorised data, they classified the behaviours of the first 15 min of each hour for all five cats, totalling 74.88 h of identified behaviours (mean 15.00 ± 3.61 h per cat).
This was representative of the behaviours exhibited by the cats for the whole time they were collared (see Appendix A 'Effects of identifying cat behaviours for 15 min per hour or the full time').This method provided an accurate measure of the time cats spent engaged in the behaviours as a reference for comparison with the RF modelling.
Second, the behaviours of the free-ranging cats were identified from their accelerometer data using the eight RF models developed from the training datasets, using the package randomForest (Breiman, 2001).To achieve this, their accelerometery data were used to calculate the same variables as those used to train the RF models, for example, the base variables were included when the RF models had been developed from base datasets (Table A4).The freeranging cat accelerometer variables were also calculated at either 40 Hz or using mean values over 1 second in the same way as the calibrated training data.The RF models were used to identify the behaviours at each instant in time (40 Hz or 1 Hz) using the 500 trees developed at each node and selected the most common outcome as the predicted behaviour.The total amount of time the cat spent on each behaviour was then summed.The time spent undertaking each behaviour was converted to a per cent of the time that the particular individual was collared.

| Data analyses
Analyses were conducted using R (version 3.4.0,R core team 2014), with a statistical significance level of p < .05. Results are expressed as mean ± 1 standard error unless otherwise indicated.
An intraclass correlation coefficient (ICC) was calculated with the DescTools package (Signorell, 2016) based on a single rating, absolute-agreement, two-way mixed effects model (Koo & Li, 2016) to compare the per cent of time cats spent on the behaviours predicted by the RF model with the per cent of time spent on the behaviours identified from the decision tree.The decision tree predictions of the behaviours were assumed to be the most precise method of behaviour identification as each behaviour signal could be compared to other examples of calibrated signals.
The ICC model assessed the reliability of the two methods (the decision tree and one RF model in each case) for providing similar results in terms of behaviour frequency and rank.If the 95% confidence intervals of the ICC estimate were greater than 0.9, between 0.9 and 0.75, between 0.75 and 0.5 and less than 0.5, this was indicative of 'excellent', 'good', 'moderate' and 'poor' reliability, respectively (Koo & Li, 2016).In the first instance, all behaviours were included in this analysis before 'rest' behaviours were removed and the comparisons re-run.

| RF model accuracy for calibrated behaviours of indoor cats
The RF model that most accurately predicted known behaviours used the extended variables, with standardised duration of behaviours, at 40 Hz.In this model, the F-measure was 0.96 ± 0.02 (Table 1) and the precision and recall were both above 0.95.The second most accurate model, with extended variables, inconsistent behaviour durations, at 40 Hz, had an F-measure of 0.94 ± 0.05 and a precision and recall above 0.93.The accuracy of the RF models was lower when the mean of each second was calculated for the variables.The most accurate model, when using the mean over 1 second, was developed from the extended variables, with a standardised duration of behaviours.This had an F-measure of 0.74 ± 0.05 and a precision and recall of 0.83 and 0.71 respectively.Thus, all datasets at 40 Hz generated more accurate models than those at 1 Hz, according to the F-measure and the OOB error rate.In addition, the datasets with standardised durations of behaviours produced the models with the highest F-measure for datasets at both 40 and 1 Hz.
The OOB error rate was higher for models with standardised durations of behaviours than the models with inconsistent durations compared to those with the same variables and frequency.While a low OOB error rate combined with a low F-measure can indicate model overfitting, the high F-measures and higher OOB error rates seen here suggest the models with standardised durations of behaviours are less prone to overfitting than those with inconsistent durations of behaviours.
Prediction accuracy varied with behaviour.Using the most accurate model (with extended variables, standardised duration of behaviours, at 40 Hz), trot, run, shake, rest, feed and groom were all identified with an F-measure above 0.92 but walk had an Fmeasure of 0.88 (Table 2).The most accurate model at 1 Hz (with extended variables and standardised duration of behaviours) had more varying accuracy with different behaviours, most accurately predicting shake, feed, rest and run (F-measures all over 0.8) but less accurately predicting groom (0.67), walk (0.58) and trot (0.58).In general, high-frequency, fast-paced behaviours (walk, trot, run and shake) were most accurately identified by models derived from the high-frequency 40 Hz datasets.Across all 40 Hz models, high-frequency behaviours were identified at an average F-measure of 0.89 ± 0.14, whereas with models at 1 Hz, higher frequency behaviours were identified at an average F-measure of 0.59 ± 0.24.Models derived from datasets at 1 Hz performed better at predicting low-frequency behaviours (feed, groom, rest) than at predicting high-frequency behaviours, with an average Fmeasure of 0.71 ± 0.25.
Based on the ICC estimate for all behaviours, there was excellent reliability between the time spent on each behaviour that was identified by the decision tree and the RF models (range = 0.999-0.999).We note though, that the high proportion of identified 'resting' behaviour could have skewed the results towards this extremely high reliability as it comprised over 90% of the cat's behaviour.The reliability of the models decreased when 'resting' behaviour was removed from the analysis (detailed below) and likely more accurately established how reliable the models were at identifying behaviours other than 'resting'.The two models with the highest degree of reliability were both derived from extended datasets with standardised duration of behaviours; this model at 40 Hz was the most reliable and had 'good reliability' (ICC of 0.756 ± 0.006), and this model at 1 Hz had 'moderate to good reliability' (ICC of 0.751 ± 0.006).These two models predicted different amounts of time the free-ranging cats spent 'walking', 'feeding' and 'grooming' (Figure 2), where the 1 Hz model slightly overestimated the amount of time spent 'walking' compared to the decision tree estimate but predicted 'feeding' and 'grooming' more accurately than the 40 Hz model.Notably, the 40 Hz model predicted hardly any 'feeding' or 'grooming' behaviours (<0.04% of the time, Figure 2), and is likely therefore unfit for use to identify free-roaming cat behaviours, despite its accuracy in predicting the behaviours in validations.Two of the remaining models, one with base variables, standardised duration of behaviours at 40 Hz and one with extended variables, inconsistent durations of behaviours at 1 Hz, had 'moderate reliability' (ICC between 0.641 and 0.748) compared to the decision tree-identified behaviours.The remaining four RF models had 'poor reliability'; these models had ICC values of less than 0.5 (see Table A5 for all ICC values and 95% CIs) (Koo & Li, 2016).

| Important variables for differentiating behaviours
The variables that were most important for improving the purity of behaviour predictions were similar in the two models that were most accurate at identifying free-ranging cat behaviours, both with extended variables, standardised durations of behaviours at 40 or 1 Hz.In fact, the top six variables were the same for both models, although in a different order (Figure 3), and at least six of the top 10 metrics were standard error variables and included the standard error of dynamic acceleration in all three axes.Both models also indicated that the dynamic acceleration of all three axes was the least important variable for improving node purity.The most important variables for the best model, at 40 Hz, were smoothed VeDBA, the standard error of the dynamic acceleration in the sway (X) axis, and then the standard error of VeDBA.

| DISCUSS ION
Identifying animal behaviours from accelerometery allows researchers to monitor cryptic species and study behaviours over a time span ranging from seconds to years (Nuijten et al., 2020;Wang et al., 2015;Wijers et al., 2018).Manual classification of long-term studies of free-ranging animals' behaviours can, however, be labour intensive (Hammond et al., 2016).Therefore, there has been increased interest in using supervised machine-learning methods, such as RF modelling, that can increase the efficiency and accuracy of behaviour identifications.Model accuracy can vary substantially according to the species studied and the details of the methodology.
RF models have been used to predict behaviours of a diverse range of species such as griffon vultures (Gyps fulvus) (Nathan et al., 2012), polar bears (Ursus maritimus) (Pagano et al., 2017)  data from wild animals that can be used to train the models (Gooden et al., 2024;Pagano et al., 2017).This, alongside adjustments to data pre-processing, should increase the accuracy of RF model behaviour predictions and has wide-ranging implications for many aspects of ecological research and conservation.

F I G U R E 2
Mean and standard error of five free-ranging domestic cats' per cent time spent on behaviours.Behaviours were identified from accelerometery data via a decision tree, and by random forest (RF) models derived from training datasets calibrated to behaviours via videoed accelerometery data of indoor cats.Definitions of each of the datasets used to develop the RF models can be found in Table A4.
The time (per cent of the day) cats spent on behaviours predicted by each model are shown by colour (see Behaviour key).'Resting' (not shown) made the total time to 100%.The model predictions were compared to the decision tree predictions through an interclass correlation coefficient (see the Statistics section for details) and good (**) and moderate (*) reliability is highlighted.The model that derived behaviours most similar to the behaviours identified using the decision tree was derived from extended variables, standardised durations of behaviours at 40 Hz.

| Effect of calculating standard error variables on model accuracy
The extended RF models derived using standard error variables had a higher accuracy than those with base variables (Table 1), demonstrating that variable selections should be critically considered to improve model accuracy.There are almost limitless variables that can be calculated, and indeed, studies have included between 8 and 128 variables in their models (Graf et al., 2015;Wijers et al., 2018), which have been further enhanced by other data, such as sound (Wijers et al., 2018), or multiple synchronised accelerometers in different locations (Tran et al., 2021).
Smit et al. ( 2023) showed greater RF accuracy in identifying domestic cat behaviours when accelerometers were attached to a harness rather than a collar, however, harnesses can hinder movements or more easily become entangled if deployed in the wild.
The selection and importance of different variables may depend on the species and its' behavioural characteristics or the behaviour of interest (Hathaway et al., 2023) as well as the computer power available-more variables require more processing power.
Furthermore, we predict that if certain variables are demonstrably useful for a given species, these provide a good starting point for work on comparable species of different sizes or those that have similar locomotor modes, as seen in the similarity between useful predictor variables from RF models for pygmy goat (Capra aegagrus hircus) and Alpine ibex (Capra ibex) behaviours (Dickinson et al., 2021).
The high decrease in Gini found for standard error variables in the two most reliable models when classifying free-ranging cat behaviours demonstrates that these are particularly useful for increasing the purity of behaviour differentiation (Figure A2).This concurs with Nathan et al. ( 2012) who note the usefulness of the standard deviation to identify griffon vulture (Gyps fulvus) behaviours.A running standard error calculated over an appropriate period provides a more constant measure of the overall size of the motion and represents the amplitude of the wave that will be consistently high for a high-energy movement (Laich et al., 2008;Nathan et al., 2012) (Figure A2).Interestingly, and likely importantly, the dynamic acceleration in the heave, surge and sway axes were consistently ranked as the three least important variables.This could be due to the wavelike form of dynamic acceleration that contains peaks and troughs that occur with each step giving a value that can be both positive and negative with appreciable variability over time (Laich et al., 2008).
This inconsistency in the dynamic acceleration appears to hinder its use as a distinguishing factor between behaviours.

| Effects of data frequency on model accuracy
Many studies identify behaviours from accelerometery data having taken a mean over 1 or 2 s (Fehlmann et al., 2017;Graf et al., 2015;Pagano et al., 2017) and Shepard, Wilson, Halsey, et al. (2008) suggest that variables should be 'smoothed' (i.e.taking a running mean) over a time period of one stroke cycle.Other studies have used smoothing periods of 3, 5 or 10 s (Campera et al., 2019;Chimienti et al., 2016;Lush et al., 2018) with varying effects on model predic-  A2.
this detail.In contrast, the 1 Hz version of the same model had a low F-measure but a good ICC reliability and provided a more accurate estimate of the time free-ranging cats spent on the stationary behaviours, 'feed' and 'groom'.We hypothesise that derivations of the mean over 1 Hz allowed a more accurate determination of stationary behaviours because these more accurately capture the motion of behaviours that are performed at a slower frequency.Slower or 'aperiodic' behaviours such as 'grooming' may be harder to identify from just a few points in the 40 Hz dataset due to the inconsistent nature of this behaviour (as noted by Graf et al., 2015 for Eurasian beavers, andChakravarty et al., 2019).It may be indicative of the variety of grooming motion frequencies and postures adopted by cats to groom their whole body and, while these variations can be visually identified by the researcher using a decision tree, the RF models struggled to deal with the inconsistency in this behaviour.The period over which the mean is taken should also be considered, especially for larger animals that might have a slower stride frequency; for example, Alvarenga et al. (2016) found for sheep, that a mean calculated over 5 or 10 s led to a higher accuracy than over 3 s.Supporting this hypothesis, European pied flycatchers (Ficedula hypoleuca) catching prey at high speeds required a frequency of over 100 Hz for accurate identification whereas slower flight required 12.5 Hz (using the 'rabc' behaviour classification R package; Yu et al., 2023).Despite these behavioural considerations, study logistics including battery life will also influence decisions on the frequency of data collection.
Certainly, our work indicates that the frequency of the data should be carefully evaluated when using RF modelling to identify specific animal behaviours accurately and indicates that taking a mean over 1 or 2 s would be particularly useful for identifying aperiodic behaviours, but the animal species and focal behaviour frequency should be considered and data processing conducted accordingly.

| Effects of standardised durations of behaviours on model accuracy
An inconsistent duration of behaviours in the training dataset has been shown to bias model predictions towards the most abundant behaviours (Chen et al., 2004;Pagano et al., 2017) and, while every effort was made to record as many samples as possible of each cat behaviour, there was an abundance of 'resting' behaviour and relatively few examples of 'groom' and 'shake' behaviours in the data.
These small sample sizes for specific behaviours did not appear to be a factor in behaviour identification accuracy, that is, they were not identified with any less precision or recall than other behaviours (Table A6).However, we did find that the models from datasets with standardised durations of each behaviour were more accurate than those with inconsistent durations of behaviours, which opposes the findings of Pagano et al. (2017) for polar bear behaviour identification who found uneven datasets were more accurate.While the higher OOB error rate and F-measure seen for our models with standardised durations of each behaviour indicate a smaller chance of overfitting, this could also be due to the smaller datasample for these models; the OOB error rate is a percentage of incorrect classifications from the training data not used in each decision tree, so each 'wrong' classification had more effect.Nevertheless, there was good evidence that a standardised duration of behaviours increased model accuracy, so sub-sampling over-abundant behaviours to create a more even distribution does seem to be important in improving the predictive capabilities of RF modelling.Interestingly, the dataset size did not appear to influence overall accuracy scores; further testing of a 40 Hz dataset that was subsampled to a similar number of data points as the 1 Hz dataset (both with extended variables and standardised distributions of behaviours) showed that the 40 Hz dataset maintained a higher F-measure (see Appendix A).This demonstrates that the absolute number of samples in the smaller 1 Hz dataset was not the driving factor in the lower F-measures or OOB error rates.

| CON CLUS IONS
RF models can be used to accurately predict animal behaviours using classified accelerometer data, but model accuracy can be improved via post-data-collect processing.Here, we show that high data frequencies, standardised durations of behaviours and extended variables improved model accuracy.The accuracy of models when identifying aperiodic behaviours, such as feeding and grooming, of animals in the wild may improve when using lower frequency data (means over 1 s) and suggests that the aperiodicity of focal behaviours should be taken into consideration when using RF modelling for identifying free-ranging animal behaviours.The validation of behaviour predictions with known free-ranging animal behaviours was important to reveal this trend and validations should also be

ACK N OWLED G EM ENTS
The authors wish to thank E. Cox and S. Loca for their assistance with data collection and Mid-Antrim Animal Sanctuary for their support and participation in this work.We also wish to thank all the owners of the cats that participated in this study.1% of the cats' body weight).Daily Diary loggers were fitted under the chin of the cat in line with the lateral (sway), sagittal (surge) and vertical (heave) body axes (Chakravarty et al., 2019).While wearing the collars, cats were filmed using a Sony Alpha a58 DSLR camera (Sony, Latin America, Inc.) for 15 min during the morning while a second researcher encouraged the cat to undertake different behaviours, such as running after a toy, or provided food to observe feeding behaviours.In addition, naturally occurring behaviours were observed, such as walking, trotting, resting, grooming and shaking the collar.These behaviours were selected as they accounted for much of the cat's (and other wild equivalent predator's) daily behaviours (Wilmers et al., 2017), are of ecological significance (Williams et al., 2014), and the repertoire can be indicative of welfare (Fuller et al., 2019).Each cat was filmed for 15 min to record the different behaviours it undertook.

Accelerometer data and video synchronisation
Accelerometer data and video footage from the indoor cats were synchronised using the timestamp of the data and video.To guard against any potential inaccuracies of their internal clocks, during the video, the collar was shaken up and down by the observers to create a distinct marking point in the accelerometer data that could be synchronised with the camera timestamp on the recording.
Once the data were downloaded and loaded into DDMT software (Wildbyte technologies, http:// wildb ytete chnol ogies.com/ softw are. html, Wilson et al., 2008), any offset that was required between the camera and the accelerometer was added to the accelerometer data.
Distinct behaviours that lasted at least two seconds were selected on the video and identified in the accelerometer data via the corrected timestamp.Transitions between behaviours were not included in any behaviour sample.DDMT is a specialised accelerometer handling software, including facilitating the 'labelling' of behaviours that could then be extracted individually.This was conducted for all distinct identifiable behaviours within the video footage.

Random forest model generation
Random forest models use a subset of known behaviour data to 'train' the model to identify behaviours and use the remaining data subset to 'test' the model accuracy.Classification trees were built using a focal behaviours were predicted as negative and returned N/A results for precision, and F-measure.If the precision was 0 (when TP and FP were 0), the F-measure could not be calculated and was returned as 0.

Generating the datasets for random forest modelling
Random forest models use variables to differentiate between behaviours.These are derived from the raw accelerometer data.The variables derived from the raw heave, surge and sway values (movements in orthogonal axes) of the identified accelerometer samples describe the animal's body motion and posture through acceleration and were selected as they describe the animals movement in different ways (Venter et al., 2019).
The variables calculated from the raw accelerometery data that constitute the base dataset were chosen based on previous accelerometery studies (Fehlmann et al., 2017;Shepard, Wilson, Halsey, et al., 2008;Shepard, Wilson, Quintana, et al., 2008;Watanabe & Takahashi, 2013;Wilson et al., 2006Wilson et al., , 2008)).This dataset can be used to train a random forest model which can then be used to identify behaviours from other accelerometer data.
A 'base' dataset of variables was calculated at 40 Hz.This comprised of 13 variables: raw acceleration ('acc'), static acceleration ('st') and dynamic acceleration ('dy'), all measured in three axes: sway, heave and surge (noted as x, y and z, respectively).The static acceleration represents the animal's posture, whereas the dynamic acceleration represents animal movements (Wilson et al., 2006).Vectoral TA B L E A 3 Summary variables extracted from accelerometer data and used in random forest models to predict domestic cat behaviours.'Pitch' and 'Roll' were also calculated (definitions and equations for these variables are provided in Tables A3 and A4).
A second 'extended' dataset at 40 Hz was generated by grouping the data from each behaviour and calculating a running 2-second standard error of raw and dynamic acceleration in the sway, heave and surge axes, VeDBA, and smoothed VeDBA (Table A3).These further eight variables generated an 'extended' dataset that consisted of the base variables and the standard error variables (Table A4).
These new variables examined the variation of each of the 'active' variables around the mean (Fehlmann et al., 2017;Laich et al., 2008;Watanabe & Takahashi, 2013).
Two further datasets were generated by calculating the mean values of the 40 Hz datasets over one second for all the variables in that dataset, generating a base and an extended dataset at 1 Hz.The frequency of the data has been shown to influence the reliability of behaviour identification when random forest modelling (Alvarenga et al., 2016).
Four more 'standardised duration' datasets were then derived, one from each of the base and extended datasets at 40 and 1 Hz.A long duration of examples of some behaviours can lead to a bias in the classification algorithm towards the more numerous behaviours (Chen et al., 2004).Therefore, a more even distribution of behaviours decreases the overestimation of the more numerous behaviours in training datasets.We used a similar method to Pagano et al. (2017), in which datasets of known behaviours were randomly subsampled to consist of a maximum of 60 s of each behaviour (rather than over 2000 s of rest behaviour, Table A2).Where less than 60 s of a certain behaviour occurred, 100% of these data was included in the analysis.
In total, eight datasets were developed (Figure 1).

Free-ranging cat data collection
Cat owners in Northern Ireland were contacted in 2016 and volunteered their animals to have their movements recorded.For the first 2 days of the study, the free-roaming cats were fitted with 'dummy collars', which were the same size and weight as functioning collars but did not contain any devices.This allowed the cat to become accustomed to wearing the collar and the added weight of the devices.
All the collars were fitted with a quick-release clasp so that it would release if the cat became entangled.The collars were adjusted to fit each cat and allow two fingers to fit between the collar and the cat (Lord et al., 2010).Upon deployment, cats were monitored for 30 min to ensure there was no discomfort.Thereafter, owners monitored their cat's behaviour to watch for any signs of stress.In the trial, no signs of stress were observed in any of the individuals we measured, so all cats were included in the study.After two days, the dummy collars were then exchanged for one that carried a VHF radio transmitter (Tabcat homing tag © 2016 Loc8tor Ltd.) and an accelerometer tag ('Daily Diary', Wilson et al., 2008).The accelerometer was set to record at 40 Hz.The VHF was only used to find the collar if it became released from the cat (which happened on one occasion and the cat was fit with a replacement collar the same day).The total weight of the collar and loggers was 61 g, less than 1.5% of the body weight of any of the cats.Throughout the study, cats were allowed to move freely in and out of their owner's house via either a cat flap or being let in and out when required.

Free-ranging cat behaviour identification from the decision tree
Through the validation process described in the text, the decision tree was accurate 82.76% of the time (Table A2).The behaviours

TA B L E A 5
The results of interclass correlation coefficient estimates and the 95% confidence intervals are based on a single rating, absolute-agreement, two-way mixed effects model.

Note:
The agreement between the per cent time spent on behaviours each day by domestic cats was compared between behaviours that were predicted by each of the random forest models to behaviours predicted from accelerometery data identified by an observer using an ethogram.
however, when the dataset at 40 Hz was subsampled to include the same number of lines as the 1 Hz datasets, the modelling accuracy decreased.This shows that the size of the datasets can increase model accuracy when identifying behaviours.We found that the subsampled 60 event datasets were still more accurate at identifying behaviours than the model derived from a 1 Hz dataset which was similar in size (although this was not tested for free-roaming cat data using the ICC reliability measure).This shows that the detail embedded in the accelerometery data recorded at 40 Hz is important when identifying behaviours and that taking a mean of the data can lose distinguishing features.This agrees with our above findings that behaviours that occur at a high frequency, such as locomotion in free-ranging cats, are more reliably identified from a model derived from a higher frequency dataset, and that if quick locomotor behaviours are the focus of a study, high frequencies would likely provide the highest accuracy.

Effects of identifying cat behaviours for 15 minutes per hour or the full time
When recording animal behaviours, the amount of time that the animal is studied for can influence the outcome of behavioural predictions (Altmann, 1974) (i.e. 15 min/h or the full time).We therefore tested whether there was a difference in the amount of time spent on the different behaviours when they were identified for the first 15 min of each hour or for the whole time, the cat was collared.
To test this, one of the five cats' behaviours was identified by the observer using the decision tree for the whole time it was collared (85.33 h).

F I G U R E A 1
Decision tree for identifying free-ranging domestic cat behaviours from tri-axial acceleration, developed from manual calibrations of behaviours and accelerometer data using concurrent video recordings.'Sleep' and 'rest' behaviours were characterised by long periods of inactivity.'Grooming' behaviours included the cat licking its fur on all parts of its body including its back, tail and paws.'Feeding' behaviour was solely from pellet food from a bowl.Locomotory behaviours, 'walk', 'trot' and 'run', were conducted in straight lines (with no corners) and characterised by the increasing speeds and different gaits.'Collar shake' or scratch was typically conducted using a hind leg or a rotatory shake of the head.Unknown behaviours that were not defined but were observed included human interactions such as the cat being stroked and active behaviours such as jumping onto a high surface or playing with toys.
The intraclass correlation coefficient (ICC) was calculated with the DescTools package (Signorell, 2016) based on a single rating, absolute-agreement, two-way mixed effects model (Koo & Li, 2016) and was used to assess the reliability of the observer identifying the behaviours of one cat for the first 15 min of each hour compared to identifying the whole time the cat was collared (as a per cent of the time identified).Analyses were conducted using R (version 3.4.0,R core team 2014).This showed whether the predictions of cat behaviours from the shorter observation times provided an accurate estimate of cat behaviours over the whole day.
There was 'excellent reliability' according to the ICC estimate (Koo & Li, 2016) between the time spent on each behaviour when tive accuracy.Here, we investigated how smoothing period affected RF model accuracy by including and testing our 1 Hz datasets, however, a model derived at 40 Hz was most accurate for identifying cat behaviours during validation stages.The high-frequency behaviours, such as 'trotting' and 'running', would have rapid oscillations in the accelerometer data and the 40 Hz dataset seems to have captured F I G U R E 3 Relative importance of predictor variables for purity of domestic cat behaviour predictions based on the mean Gini index for (a) the 40 Hz and (b) 1 Hz model generated using extended variables with standardised durations of behaviours.Variable abbreviations are detailed in the methods and Table prioritised in future studies to ensure wild animal behaviour predictions are accurate.AUTH O R CO NTR I B UTI O N S Carolyn E. Dunford: Conceptualization (lead); data curation (lead); formal analysis (lead); investigation (lead); methodology (lead); validation (lead); visualization (lead); writing -original draft (lead); writing -review and editing (equal).Nikki J. Marks: Conceptualization (equal); funding acquisition (equal); project administration (equal); supervision (equal); writing -review and editing (equal).Rory P. Wilson: Conceptualization (equal); resources (equal); supervision (equal); writing -review and editing (equal).D. Michael Scantlebury: Conceptualization (equal); funding acquisition (equal); project administration (equal); supervision (equal); writing -original draft (equal); writing -review and editing (equal).

A
C. Dunford was supported from a DfE studentship awarded to D. M. Scantlebury and N. J. Marks.How to cite this article: Dunford, C. E., Marks, N. J., Wilson, R. P., & Scantlebury, D. M. (2024).Identifying animal behaviours from accelerometers: Improving predictive accuracy of machine learning by refining the variables selected, data frequency, and sample duration.Ecology and Evolution, 14, e11380.https://doi.org/10.1002/ece3.11380housed indoors for behaviour recording Nine adult domestic cats (4 females, 5 males; aged 6 months-8 years) housed at Mid Antrim Animal Sanctuary, Antrim, Northern Ireland in rooms (2 m × 3 m) were studied in June and July 2017.Cats were free to move to an enclosed outside area (2 × 2 m).All individuals were either neutered or spayed and were certified as healthy by a veterinarian prior to participation in the study.Cats were fitted with quick-release collars (Breakaway buckle collar, Rogz Ltd. 2002/030628/07) to which a tri-axial accelerometer ('Daily Diary', Wilson et al., 2008) recording at 40 Hz was attached.The total weight of the collar and logger was 25 g (less than identified by decision tree by an observer for 15 min per hour or for the whole time.The ICC estimate was 0.98 with a 95% confidence interval from 0.982 to 0.984, F(5, 5) = 59.6, p < .001,which shows that there was little difference in the time the cat was estimated to have spent on each behaviour whether the observer identified behaviours for 15 min or for the whole 60 min per hour and gives confidence to our predictions of cat behaviours.F I G U R E A 2 Diagrammatic representation of a single axis of dynamic acceleration and the standard error of the dynamic acceleration during three speeds of locomotion.

inconsistent durations, 40 Hz Base variables, standardised durations, 40 Hz
Precision, recall and F-measure for random forest model testing of known cat behaviours, with the mean and standard error of the mean (SEM) for each model.

inconsistent durations, 40 Hz Extended variables, standardised durations, 40 Hz
Note: N/A values occurred if no sample of the behaviour was correctly identified.